import pandas as pd
df = pd.read_csv('lightcast_job_postings.csv' )
df.head()
0
1f57d95acf4dc67ed2819eb12f049f6a5c11782c
2024-09-06
2024-09-06 20:32:57.352 Z
0
2024-06-02
2024-06-08
6.0
[\n "Company"\n]
[\n "brassring.com"\n]
[\n "https://sjobs.brassring.com/TGnewUI/Sear...
...
44
Retail Trade
441
Motor Vehicle and Parts Dealers
4413
Automotive Parts, Accessories, and Tire Retailers
44133
Automotive Parts and Accessories Retailers
441330
Automotive Parts and Accessories Retailers
1
0cb072af26757b6c4ea9464472a50a443af681ac
2024-08-02
2024-08-02 17:08:58.838 Z
0
2024-06-02
2024-08-01
NaN
[\n "Job Board"\n]
[\n "maine.gov"\n]
[\n "https://joblink.maine.gov/jobs/1085740"\n]
...
56
Administrative and Support and Waste Managemen...
561
Administrative and Support Services
5613
Employment Services
56132
Temporary Help Services
561320
Temporary Help Services
2
85318b12b3331fa490d32ad014379df01855c557
2024-09-06
2024-09-06 20:32:57.352 Z
1
2024-06-02
2024-07-07
35.0
[\n "Job Board"\n]
[\n "dejobs.org"\n]
[\n "https://dejobs.org/dallas-tx/data-analys...
...
52
Finance and Insurance
524
Insurance Carriers and Related Activities
5242
Agencies, Brokerages, and Other Insurance Rela...
52429
Other Insurance Related Activities
524291
Claims Adjusting
3
1b5c3941e54a1889ef4f8ae55b401a550708a310
2024-09-06
2024-09-06 20:32:57.352 Z
1
2024-06-02
2024-07-20
48.0
[\n "Job Board"\n]
[\n "disabledperson.com",\n "dejobs.org"\n]
[\n "https://www.disabledperson.com/jobs/5948...
...
52
Finance and Insurance
522
Credit Intermediation and Related Activities
5221
Depository Credit Intermediation
52211
Commercial Banking
522110
Commercial Banking
4
cb5ca25f02bdf25c13edfede7931508bfd9e858f
2024-06-19
2024-06-19 07:00:00.000 Z
0
2024-06-02
2024-06-17
15.0
[\n "FreeJobBoard"\n]
[\n "craigslist.org"\n]
[\n "https://modesto.craigslist.org/sls/77475...
...
99
Unclassified Industry
999
Unclassified Industry
9999
Unclassified Industry
99999
Unclassified Industry
999999
Unclassified Industry
5 rows × 131 columns
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px
df.info()
df.isna().sum ()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72476 entries, 0 to 72475
Columns: 131 entries, ID to NAICS_2022_6_NAME
dtypes: bool(2), float64(11), int64(27), object(91)
memory usage: 71.5+ MB
ID 0
LAST_UPDATED_DATE 0
LAST_UPDATED_TIMESTAMP 0
DUPLICATES 0
POSTED 0
..
NAICS_2022_4_NAME 0
NAICS_2022_5 0
NAICS_2022_5_NAME 0
NAICS_2022_6 0
NAICS_2022_6_NAME 0
Length: 131, dtype: int64
df['TITLE_NAME' ] = df['TITLE_NAME' ].astype(str )
df['IS_AI_ROLE' ] = df['TITLE_NAME' ].str .lower().str .contains(
'data|ai|machine learning|ml|artificial intelligence'
).astype(int )
df['IS_AI_ROLE' ].value_counts()
IS_AI_ROLE
0 48310
1 24166
Name: count, dtype: int64
features = ['REMOTE_TYPE_NAME' , 'EDUCATION_LEVELS_NAME' , 'NAICS_2022_2_NAME' , 'MAX_YEARS_EXPERIENCE' ]
df_model = df[features + ['IS_AI_ROLE' ]].dropna()
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
X = df_model[features]
y = df_model['IS_AI_ROLE' ]
encoder = OneHotEncoder(sparse_output= False , handle_unknown= 'ignore' )
X_encoded = encoder.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size= 0.2 , random_state= 42 )
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
clf = LogisticRegression(max_iter= 1000 )
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print (confusion_matrix(y_test, y_pred))
print (classification_report(y_test, y_pred))
[[865 175]
[354 292]]
precision recall f1-score support
0 0.71 0.83 0.77 1040
1 0.63 0.45 0.52 646
accuracy 0.69 1686
macro avg 0.67 0.64 0.65 1686
weighted avg 0.68 0.69 0.67 1686
import plotly.express as px
y_probs = clf.predict_proba(X_test)[:, 1 ]
fig = px.histogram(x= y_probs, nbins= 50 , title= "Predicted Probability of AI Job" , labels= {'x' : 'Probability' })
fig.show()
This histogram shows the predicted probabilities of jobs being AI-related, with most values falling between 0.2 and 0.6, indicating that the model has moderate confidence in distinguishing AI from non-AI roles.
import numpy as np
features_cat = ['REMOTE_TYPE_NAME' , 'EDUCATION_LEVELS_NAME' , 'NAICS_2022_2_NAME' ]
features_num = ['MAX_YEARS_EXPERIENCE' , 'DURATION' ]
df_model = df[features_cat + features_num + ['IS_AI_ROLE' ]].dropna()
X_cat = df_model[features_cat]
X_num = df_model[features_num]
y = df_model['IS_AI_ROLE' ]
encoder = OneHotEncoder(sparse_output= False , handle_unknown= 'ignore' )
X_cat_encoded = encoder.fit_transform(X_cat)
X_full = np.hstack((X_cat_encoded, X_num.values))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size= 0.2 , random_state= 42 )
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter= 1000 )
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print (confusion_matrix(y_test, y_pred))
print (classification_report(y_test, y_pred))
[[498 97]
[250 163]]
precision recall f1-score support
0 0.67 0.84 0.74 595
1 0.63 0.39 0.48 413
accuracy 0.66 1008
macro avg 0.65 0.62 0.61 1008
weighted avg 0.65 0.66 0.64 1008
import plotly.express as px
y_probs = clf.predict_proba(X_test)[:, 1 ]
fig = px.histogram(x= y_probs, nbins= 50 , title= "Predicted Probability of AI Job (Enhanced Features)" , labels= {'x' : 'Probability' })
fig.show()
This histogram displays the predicted probabilities of jobs being AI-related using enhanced features, showing a concentration around 0.3 to 0.6, which suggests the model still struggles to confidently separate AI from non-AI positions.
df['SALARY_FROM' ] = pd.to_numeric(df['SALARY_FROM' ], errors= 'coerce' )
df['SALARY_TO' ] = pd.to_numeric(df['SALARY_TO' ], errors= 'coerce' )
df['AVG_SALARY' ] = (df['SALARY_FROM' ] + df['SALARY_TO' ]) / 2
features_cat = ['REMOTE_TYPE_NAME' , 'EDUCATION_LEVELS_NAME' , 'NAICS_2022_2_NAME' ]
features_num = ['MAX_YEARS_EXPERIENCE' , 'DURATION' , 'IS_AI_ROLE' ]
target = 'AVG_SALARY'
df_reg = df[features_cat + features_num + [target]].dropna()
X_cat = df_reg[features_cat]
X_num = df_reg[features_num]
y = df_reg[target]
encoder = OneHotEncoder(sparse_output= False , handle_unknown= 'ignore' )
X_cat_encoded = encoder.fit_transform(X_cat)
import numpy as np
X_full = np.hstack((X_cat_encoded, X_num.values))
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size= 0.2 , random_state= 42 )
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print (f"RMSE: { rmse:.2f} " )
print (f"R² Score: { r2:.3f} " )
RMSE: 27763.52
R² Score: 0.409
import plotly.express as px
import pandas as pd
df_plot = pd.DataFrame({
'Actual Salary' : y_test,
'Predicted Salary' : y_pred
})
fig = px.scatter(df_plot, x= 'Actual Salary' , y= 'Predicted Salary' , trendline= 'ols' ,
title= 'Actual vs. Predicted Salary' )
fig.show()
This scatter plot compares actual vs. predicted salaries from the regression model, with the trendline indicating a positive correlation and generally accurate predictions, though deviations increase at higher salary levels.
import statsmodels.api as sm
X_cat = df_reg[features_cat]
X_num = df_reg[features_num]
y = df_reg[target]
X_cat_encoded = encoder.fit_transform(X_cat)
X_full = np.hstack((X_cat_encoded, X_num.values))
X_full_const = sm.add_constant(X_full)
model = sm.OLS(y, X_full_const).fit()
print (model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: AVG_SALARY R-squared: 0.419
Model: OLS Adj. R-squared: 0.408
Method: Least Squares F-statistic: 38.26
Date: Fri, 02 May 2025 Prob (F-statistic): 1.03e-233
Time: 20:19:54 Log-Likelihood: -27208.
No. Observations: 2325 AIC: 5.450e+04
Df Residuals: 2281 BIC: 5.476e+04
Df Model: 43
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.844e+04 3076.562 15.744 0.000 4.24e+04 5.45e+04
x1 6185.2580 3901.325 1.585 0.113 -1465.258 1.38e+04
x2 6240.6035 4529.491 1.378 0.168 -2641.748 1.51e+04
x3 1.964e+04 1930.887 10.172 0.000 1.59e+04 2.34e+04
x4 1.637e+04 1749.327 9.359 0.000 1.29e+04 1.98e+04
x5 -1.815e+04 8064.455 -2.250 0.025 -3.4e+04 -2334.413
x6 -7593.2178 6654.948 -1.141 0.254 -2.06e+04 5457.165
x7 2088.7084 1.29e+04 0.162 0.871 -2.32e+04 2.73e+04
x8 -5.321e+04 2.82e+04 -1.887 0.059 -1.08e+05 2089.448
x9 8243.6458 3043.062 2.709 0.007 2276.187 1.42e+04
x10 2.211e+04 3301.251 6.698 0.000 1.56e+04 2.86e+04
x11 2.888e+04 4903.334 5.889 0.000 1.93e+04 3.85e+04
x12 1.649e+04 9201.957 1.792 0.073 -1550.958 3.45e+04
x13 -2.381e+04 5411.642 -4.399 0.000 -3.44e+04 -1.32e+04
x14 1.659e+04 1.65e+04 1.008 0.314 -1.57e+04 4.89e+04
x15 -1.905e+04 1.3e+04 -1.468 0.142 -4.45e+04 6388.395
x16 -1.674e+04 2.2e+04 -0.759 0.448 -6e+04 2.65e+04
x17 -2.524e+04 4699.026 -5.371 0.000 -3.45e+04 -1.6e+04
x18 2895.4505 1.43e+04 0.203 0.839 -2.51e+04 3.09e+04
x19 2.791e+04 7087.566 3.938 0.000 1.4e+04 4.18e+04
x20 3.086e+04 1.19e+04 2.601 0.009 7594.241 5.41e+04
x21 1.571e+04 3258.754 4.822 0.000 9321.700 2.21e+04
x22 4.044e+04 1.64e+04 2.462 0.014 8226.015 7.26e+04
x23 5021.4975 7086.938 0.709 0.479 -8876.020 1.89e+04
x24 5513.9066 2422.785 2.276 0.023 762.815 1.03e+04
x25 -1.09e+04 2e+04 -0.545 0.586 -5.01e+04 2.83e+04
x26 8921.1183 8685.952 1.027 0.304 -8112.072 2.6e+04
x27 1.828e+04 4057.113 4.505 0.000 1.03e+04 2.62e+04
x28 -9030.8811 3559.594 -2.537 0.011 -1.6e+04 -2050.501
x29 7481.3685 2411.669 3.102 0.002 2752.074 1.22e+04
x30 5487.7656 3317.028 1.654 0.098 -1016.941 1.2e+04
x31 -2470.4504 2880.731 -0.858 0.391 -8119.578 3178.677
x32 -1674.6292 1.27e+04 -0.132 0.895 -2.66e+04 2.33e+04
x33 4946.7968 3778.383 1.309 0.191 -2462.628 1.24e+04
x34 -1069.9221 1.64e+04 -0.065 0.948 -3.31e+04 3.1e+04
x35 -4538.9700 6867.044 -0.661 0.509 -1.8e+04 8927.335
x36 8610.4137 2229.705 3.862 0.000 4237.952 1.3e+04
x37 -1.016e+04 9611.329 -1.058 0.290 -2.9e+04 8683.526
x38 1.702e+04 3814.113 4.463 0.000 9543.213 2.45e+04
x39 1567.8333 4262.420 0.368 0.713 -6790.791 9926.457
x40 -2361.9051 9143.105 -0.258 0.796 -2.03e+04 1.56e+04
x41 1504.9433 2431.035 0.619 0.536 -3262.327 6272.213
x42 6621.0132 6606.447 1.002 0.316 -6334.260 1.96e+04
x43 -326.6600 2612.269 -0.125 0.900 -5449.332 4796.012
x44 7778.2154 276.312 28.150 0.000 7236.366 8320.064
x45 -33.0613 44.169 -0.749 0.454 -119.677 53.555
x46 -8409.9107 1418.748 -5.928 0.000 -1.12e+04 -5627.739
==============================================================================
Omnibus: 2668.528 Durbin-Watson: 1.899
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1149034.332
Skew: 5.268 Prob(JB): 0.00
Kurtosis: 111.397 Cond. No. 1.16e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.4e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
df_cluster = df[['SALARY_FROM' , 'SALARY_TO' , 'MAX_YEARS_EXPERIENCE' , 'DURATION' , 'IS_AI_ROLE' ]].dropna()
df_cluster['AVG_SALARY' ] = (df_cluster['SALARY_FROM' ] + df_cluster['SALARY_TO' ]) / 2
X_cluster = df_cluster[['AVG_SALARY' , 'MAX_YEARS_EXPERIENCE' , 'DURATION' , 'IS_AI_ROLE' ]]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters= 4 , random_state= 42 )
df_cluster['Cluster' ] = kmeans.fit_predict(X_scaled)
from sklearn.decomposition import PCA
import plotly.express as px
pca = PCA(n_components= 2 )
components = pca.fit_transform(X_scaled)
df_cluster['PC1' ] = components[:, 0 ]
df_cluster['PC2' ] = components[:, 1 ]
fig = px.scatter(
df_cluster,
x= 'PC1' , y= 'PC2' ,
color= 'Cluster' ,
title= "KMeans Clustering of Job Types" ,
labels= {'Cluster' : 'Cluster ID' },
opacity= 0.7
)
fig.show()
This scatter plot visualizes KMeans clustering results using PCA-reduced features, revealing four distinct job clusters based on salary, experience, duration, and AI classification.
cluster_summary = df_cluster.groupby('Cluster' )[['AVG_SALARY' , 'MAX_YEARS_EXPERIENCE' , 'DURATION' , 'IS_AI_ROLE' ]].mean().round (1 )
display(cluster_summary)
Cluster
0
104641.6
4.0
47.8
0.5
1
149513.9
6.9
20.9
0.2
2
86715.1
2.7
19.7
1.0
3
92997.1
2.4
17.7
0.0
Regression model: 1. Model goals and features The goal of this regression model is to predict job salary based on structural features and analyze the direction and strength of the influence of features on this result. The selected features are: job name, industry (NAICS classification), state, number of skills required, whether it is a remote position, and whether it is an AI-related position.
Model conclusions and insights The model results show that AI-related jobs generally pay higher salaries than non-AI jobs, and remote job opportunities and jobs located in high-technology areas often offer more competitive salaries. In addition, higher skill demand for jobs generally leads to higher wages, but this growth effect tends to weaken after a certain amount of skill increases.
Advice for job seekers For job seekers looking for a high salary, they can prioritize AI-related positions, or cross positions, and focus on improving machine learning, AWS and other related skills; You can also focus on remote jobs and businesses in other high-paying locations.
Classification model 1. Model goals and features The goal of this classification model is to divide the data sample into two or more categories, predict the category through the input features, and predict whether the job belongs to the ai field. The selected features include keywords in the job title, industry code (NAICS), number of skills required, whether or not to work remotely, and location of Posting.
Model conclusions and insights We constructed two classification models to predict post outcomes, Model two introduced additional features and performed well. The overall precision of the model is improved to 69%, and the recall for AI roles is increased from 0.39 to 0.45. However, the F1 score is still 0.52, and the class imbalance problem still exists.
Advice for job seekers For job seekers interested in entering the field of AI, it is recommended to use relevant job titles in the field of AI on their resumes and focus on skills related to machine learning and cloud computing. In addition, knowing which job titles and keywords are closely related to AI can also help them more accurately screen job postings and optimize their job search strategy.